NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

Chatterjee, A; Gokhale, T; Baral, C; Yang, Y (June 2024, CVPR)

Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results the impact of the language prior particularly in terms of generalization and robustness remains unexplored. In this paper we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric three-dimensional spatial relationships incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally to provide a foundation for future research we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings.
more » « less
Full Text Available
VQA-LOL: Visual Question Answering Under the Lens of Logic

https://doi.org/10.1007/978-3-030-58589-1

Gokhale, T.; Banerjee, P.; Baral, C; Yang, Y. (November 2020, ECCV 2020: Computer Vision – ECCV 2020)
null (Ed.)
Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.
more » « less
Full Text Available
An Action Language for Multi-Agent Domains: Foundations

Baral, C; Gelfond, G; Pontelli, E; Son, T (December 2019, Artificial intelligence)

In multi-agent domains (MADs), an agent's action may not just change the world and the agent's knowledge and beliefs about the world, but also may change other agents' knowledge and beliefs about the world and their knowledge and beliefs about other agents' knowledge and beliefs about the world. The goals of an agent in a multi-agent world may involve manipulating the knowledge and beliefs of other agents' and again, not just their knowledge/belief about the world, but also their knowledge about other agents' knowledge about the world. Our goal is to present an action language (mA+) that has the necessary features to address the above aspects in representing and RAC in MADs. mA+ allows the representation of and reasoning about different types of actions that an agent can perform in a domain where many other agents might be present -- such as world-altering actions, sensing actions, and announcement/communication actions. It also allows the specification of agents' dynamic awareness of action occurrences which has future implications on what agents' know about the world and other agents' knowledge about the world. mA+ considers three different types of awareness: full-, partial- awareness, and complete oblivion of an action occurrence and its effects. This keeps the language simple, yet powerful enough to address a large variety of knowledge manipulation scenarios in MADs. The semantics of mA+ relies on the notion of state, which is described by a pointed Kripke model and is used to encode the agent's knowledge and the real state of the world. It is defined by a transition function that maps pairs of actions and states into sets of states. We illustrate properties of the action theories, including properties that guarantee finiteness of the set of initial states and their practical implementability. Finally, we relate mA+ to other related formalisms that contribute to RAC in MADs.
more » « less
Full Text Available
Combining Knowledge and Reasoning through Probabilistic Soft Logic for Image Puzzle Solving

Aditya, S.; Yang, Y.; Baral, C; Aloimonos, Y (August 2018, Uncertainty in artificial intelligence)

Full Text Available

Search for: All records